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Abstract 


Metric learning aims to embed one metric space into another to benefit tasks like 
classification and clustering. Although a greatly distorted metric space has a high 
degree of freedom to fit training data, it is prone to overfitting and numerical 
inaccuracy. This paper presents bounded-distortion metric learning (BDML), a 
new metric learning framework which amounts to finding an optimal Mahalanobis 
metric space with a bounded-distortion constraint. An efficient solver based on 
the multiplicative weights update method is proposed. Moreover, we generalize 
BDML to pseudo-metric learning and devise the semidefinite relaxation and a 
randomized algorithm to approximately solve it. We further provide theoretical 
analysis to show that distortion is a key ingredient for stability and generalization 
ability of our BDML algorithm. Extensive experiments on several benchmark 
datasets yield promising results. 

1 Introduction 

Distance metric learning is a fundamental problem in machine learning, since many learning algo¬ 
rithms, e.g., k-nearest neighbors (kNN) and k-means, crucially rely on a “good” metric. The criteria 
of good metrics may differ in various learning tasks. For instance, in supervised learning, a common 
criterion is to learn a metric with a low empirical error lf39l . while in unsupervised learning, a good 
criterion is to learn a metric that minimizes the intra-cluster distance and simultaneously maximizes 
the inter-cluster distance ED. 

In essence, metric learning aims to search a metric embedding to convert the original metric space 
(e.g.. Euclidean) into a new one, which better suits learning tasks with regard to the above criteria. 
Such an embedding intrinsically induces distortion - a concept in the theory of metric embedding El. 
which intuitively measures the effort to reshape the metric space. 

Although a large-distorted metric space can have a high degree of freedom to describe data, it may 
be prone to overfitting. In fact, we will theoretically validate the intuition later by proving that 
in our case of Mahalanobis metric learning, the generalization bound depends increasingly on the 
distortion. Moreover, the numerical inaccuracy would also be a problem if the distortion is extremely 
large. We will show that the distortion of Mahalanobis metric learning is the condition number of 
the parameter matrix. Inevitably, a large distortion would make the matrix ill-conditioned. 

In order to balance the fitness of the learned the metric space to training data and the distortion 
of the underlying metric embedding, we present bounded-distortion metric learning (BDML), a 
generic framework that imposes a bounded-distortion constraint to the learning objective. While it 
fits various metric learning objectives, we concentrate on learning Mahalanobis metric space, which 
leads to a semidefinite programming (SDP) formulation. 
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We approach the SDP via a bisection method, which involves solving a sequence of convex feasibil¬ 
ity problems with fast multiplicative weights update ll23l . Moreover, to deal with the pseudo-metric 
learning, we apply the spectral decomposition to the parameter matrix and perform joint learning of 
dimension reduction mapping and metric. We relax the resultant non-convex quadratic constrained 
quadratic programming (QCQP) to a SDP and achieve the approximation by a randomized algo¬ 
rithm. Theoretical analysis is provided to reveal that distortion has a direct impact on the stability 
and generalization ability of a class of metric learning algorithms. Experimental results on several 
benchmark datasets manifest the usefulness of our BDML. 

2 Related Work 

Metric learning algorithms can be categorized according to different criteria, such as Maha- 
lanobis EC 27 2) and non-Mahalanobis EH 21 HI methods; probabilistic ll20l [8] and non- 
probabilistic EH 2i methods; unsupervised t34l . supervised ll42l and semi-supervised ED meth¬ 
ods; and global El and local fl3l methods. 

Based on the type of constraints, we can also classify them into pairwise and triplet-wise ones. 
Pairwise methods ED0 often adds constraints to enforce distances between pairs of dissimilar 
points are larger than a given threshold. Representative methods in the triplet group are the large- 
margin nearest neighbor l39l and its variants l24l . They exploit the local triplet constraints to assure 
that the distance between any point and its different-class neighbour should be at least one unit 
margin further than the distance between it and its same-class neighbour. Intuitively, if these triplet 
constraints are well satisfied, the empirical loss of kNN would be small. 

Metric learning is closely related to metric embedding, an important topic in theoretical computer 
science that has played an important role to design approximation algorithms. One line of research 
focuses on how to embed a finite metric space into normed spaces with a low distortion BED, 
i.e., preserving the structure of the original metric space. Metric learning is also related to manifold 
learning ED and kernel learning ESI- Learning a distance metric function amounts to learning a 
kernel function that measures the similarity between points. 

3 Bounded-Distortion Metric Learning 

Before going to details, we introduce necessary notations. S d = {M\M £ U. dxd ,M T = M} is 
the space of all d x d real symmetric matrices equipped with the Frobenius inner product A* B = 
TV (A 1 /i). The positive semidefinite (PSD) cone and positive definite (PD) cone are denoted as 
S+ = {M\M £ S d ,M y 0} and S+ + = {M\M € S d ,M y 0} respectively. The convex set 
P^. = {M\M £ §++, Tr(M) < i?} is also used. The trace bound R in P^. is a parameter to ensure 
a bounded domain for M. denotes the d-dimensional nonnegative orthant. 

3.1 Distortion of Metric Embedding 

A definition of metric space is the following. 

Definition 1. A pair (X, dx) is called a metric space, where A 7 is a set of points and dx '■ X x X —► 
[0, (X)) is a distance function satisfying the following conditions for all Xi,Xj,x k £ X: 

• dx{xi,Xj) = 0, iff Xi = Xj, 

• dx(xi,Xj) = d x (xj,Xi), 

• d x (xi,Xj) + dx{xj,x k ) > d x (xi,x k ). 

A Mahalanobis metric space is a metric space equipped with a Mahalanobis distance function, which 
often takes the form of dx(xi, Xj) = ^(xi — Xj) T M(xt — Xj), parameterized by a PD matrix M, 
i.e., M £ §+ + . In the metric learning literatures, a PSD M is usually adopted, thus the induced 
distance function is a pseudo-metric in the strict sense. We now focus on the PD case and defer the 
PSD one to Sect. [5] Obviously, the Euclidean space is a special Mahalanobis metric space where 
M is an identity matrix. Note that we deal with the squared Mahalanobis distance since it does not 
affect learning methods ( e.g., kNN) that are based on relative distances. 
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Figure 1: Ellipsoids with various condition numbers, where (a) and (c) have the same logDet value 
« 3.5; (b) and (c) have the same F-norm value ss 7.2. 


We can embed one metric space into another with a certain degree of distortion. The formal defini¬ 
tion of metric embedding and its distortion are as follows mm. 

Definition 2. Let (X. d x ) and iy. dy) be two metric spaces. A mapping / : X —> y is said to be a 
c-embedding if there exists r > 0 such that for all x, y £ X, 

r ■ d x (x,y) < dy(f(x), f(y)) < cr ■ d x (x,y). (1) 

The distortion of / is defined as the infimum of all c such that / is a c-embedding. 

Distortion is a measure of the distance between two metric spaces and plays an important role in the 
theory of metric embeddings. Later we will show that distortion is essential to stabilize a class of 
metric learning algorithms. 

In Mahalanobis metric learning, given an Euclidean metric space (X,di), we learn a metric embed¬ 
ding fi-tM. which returns us a desired Mahalanobis metric space (X, d\i)■ We have the following 
proposition to specify the distortion of this metric embedding. 

Proposition 1. The distortion of the metric embedding fr^M is the condition number k(M). 

Due to the page limit, we focus on presenting important results in this paper and defer all proofs to 
the appendix in the supplementary file. 

3.2 Geometric Meaning of Distortion 

Distortion can be intuitively regarded as a complexity measure of metric embedding. From a ge¬ 
ometric perspective, it possesses an intrinsically different meaning compared to other complexity 
measures in previous work, including the log determinant (logDet) 0 and Frobenius norm (F- 
norm) ED. Specifically, we focus on analyzing the metric embedding fi-^M and consider an 
origin-centered ellipsoid £ = {x £ R. d \x T Mx < 1 } for simplicity. 

Let {A,;}f =1 be the set of eigenvalues of M. It is well-known that the logarithmic volume of £ is 
log(F(£)) = log( 7 ) — ^logDet(M), where 7 is the volume of the unit sphere in R d . The squared 
F-norm of M is defined as ||M||^, = Y2i=i 1 / r f. where 77 = 1 /\/\ is the length of the i-th 
semi-axis. The condition number is k(M) = i'ma X / r rnin- In other words, logDet measures the 
volume variation; F-norm indicates the change of overall lengths of semi-axes; while the condition 
number describes the length ratio variation between the longest and the shortest semi-axis. As 
illustrated in Fig. |T] it is possible that a metric with a small logDet or F-norm value is ill-conditioned, 
i.e., the ellipsoid is extremely elongated. Therefore, rather than focusing on the absolute variation, 
distortion measures the relative one, thus enabling higher freedom and directly controling the well¬ 
conditioning property. 

3.3 Pair and Triplet Constrained BDML 

Given a training dataset D = {( 27 , yi)}fL 1 , where 27 £ is a data point and yi is the correspond¬ 
ing class label, our task is to obtain a Mahalnobis distance metric space (X, dM), where dyj is the 
distance function defined as, dM{xi.Xj) = (27 — 27 ) T M(xj — 27) = M • X t j, and M £ §++■ 
Here we define X,j = (27 — Xj)(xi — 27 ) T for notation simplicity. 
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Algorithm 1 : A Bisection Method 

1: Given the interval of g* as [L,U], tolerance e > 0, 

2 : Repeat 

3: g=(L + U)/ 2. 

4: Solve the convex feasibility problem ([4]). 

5: If problem (|4]) is feasible: U = g. 

6 : Else: L = g. 

7: Until U - L<e 

8: Return the final objective value as g. 


We now present two notable formulations of our bounded-distortion metric learning (BDML), which 
correspond to two types of constraints in the literature of metric learning, i.e., pairwise ones and 
triplet-wise ones. Specifically, for any data point a;*, we consider its fc-nearest neighbours set 0, : . 
Following (39), two types of neighbor points are distinguished. They are target neighbors that share 
the same class label with x n , and imposter neighbors that have different class labels with 

Let S = {(i,j)\xj £ Qi,yj = Vi} and I = {(i,j)\xj £ Oj, y 3 ^ //,} be the sets of all index 
pairs of target neighbors and impostor neighbors respectively. T = k)\(i, j) £ S , (i, k ) £ 1} 

denotes the set of all such index triplets. 

We define the pair-constrained bounded-distortion metric learning (p-BDML) as 

min — M • X,;,- 

MePjj n 

s.t. M • Xij > p, \/(i,j)£l, 

k(M) < K , (2) 

where n = |<S| and K is a parameter to control the upper bound of distortion. Pair-wise constraints 
are designed to pull two imposter neighbors farther than a margin or push two target neighbors closer 
than a margin in literatures. Here we only consider the former purpose and minimize the average 
distance of target neighbors as in fiTj . 

The triplet-constrained bounded-distortion metric learning (f-BDML) is formulated as, 

min — V'' M • 

Me n 

s.t. M • X ik -Mm Xij > p, V(z, j, k ) G T, 

k(M) < K. (3) 

The above inequality constraint of triplet nearest neighbors ensure that any given point has its im¬ 
postor neighbors at least one unit margin farther than its target neighbors. Note that the unit margin 
in j39l can be set as a arbitrarily positive constant, since it only affects the scale of M. In our case, 
since we consider a bounded domain of M (i.e., M £ P^), the margin p is treated as a positive 
parameter. 

Note that the bounded-distortion constraint k(M) < K implies that M should be PD since otherwise 
k(M) is unbounded. Albeit nonconvex, the condition number function is quasi-convex. It means 
all its sublevel sets are convex. This property enables us to transform p-BDML and f-BDML to the 
standard formulation of SDP. 


4 A Bisection Algorithm with Multiplicative Weights Update 

We now present a bisection algorithm for approaching our /f-BDML and f-BDML, which essentially 
solves a sequence of convex feasibility problems. For each feasibility problem, we resort to the 
multiplicative weights update (MWU) method Esin, which is a meta algorithm and has many 
variants in different disciplines. The reason of choosing MWU is that it generates an approximate 
solution with guaranteed constraint violation - it is important for the analysis in Sec.[6]to hold. 
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4.1 Sequential Convex Feasibility Problems 


In what follows, we only describe the convex feasibility problem for p-BDML, since the formulation 
for f-BDML only differs in constraints. We denote the objective function as g(M) = G • M where 
G = ^ j)cs Xij- Its optimal value g* is assumed to lie in the initial interval [L, U], where L 
and U are set as 0 and g(I) respectively. I is the identity matrix. Our bisection algorithm estimates 
g and narrows down the interval by half in each iteration. The procedure of the bisection algorithm 
is outlined in Alg. |T] 

Specifically, if g is not larger than g in one iteration, we solve a convex feasibility problem as 

find M £ Pjj, a > 0 
s.t. G • M < g, 

M • Xij > p, 

al AM A aKI. (4) 

Here we introduce a positive auxiliary variable a and transform the bounded-distortion constraint 
into two generalized inequality constraints. The resultant convex feasibility problem can be approx¬ 
imately solved by the efficient MWU method, to be elaborated on in the next section. 

4.2 Multiplicative Weights Update Method 

Before applying MWU method, we reformulate the feasibility problem Q to a general form via 
introducing slack variables Mi = M — al and M 2 = aKI — M. Then we construct a sparse 
symmetric matrix Y £ P| d+1 of which the block diagonal entries are M, Mi, M 2 and a. All 
constraints except P^ d+1 in (j4ji are rewritten as Ji • Y > hi. P^ +1 can be deemed as an easy 
constraint, contrary to each hard constraint Ji • Y > hi. With this change, we obtain the equivalent 
formulation of 0 as 

find Y £ P^ +1 

s.t. Ji»Y>hi, V* = l,...,m. (5) 

The number of constraints is m = \I\ + 4 d 2 + 2. We also introduce a closely related feasibility 
problem as 

find Y £ P| d+1 

E 7YI 

Pi ( Ji • Y - hi) > 0. (6) 

i—1 

Here p = \pi, ...,p m ] T is a probability vector, i.e., Yi,Pi > 0 and Y^iLiPi = 1- Note that this 
problem only contains 2 constraints, easy to solve. The relationship between problem ([5]) and ([6]i is 
summarized by the following lemma. 

Lemma 1. If problem (|5]) has a feasible solution Y*, given any probability p, then Y* is feasible 
for problem 0- Equivalently, if there exists a probability p such that problem 0 is infeasible, then 
problem 0 is infeasible. 

With the lemma, we now describe the MWU method for approaching problem ([5]). Basically, MWU 
maintains a weight vector w £ R™, where each entry ui, represents the importance of the i-th 
constraint. It iteratively solves problem 0 and updates weights w according to the constraint sat¬ 
isfaction Ji »Y — hi. Intuitively, if one constraint is more satisfied, the corresponding importance 
should be less and we should decrease its weight. 

In f-th round of the algorithm, we get a probability vector p^- 1 by normalizing the nonneg¬ 
ative weights Then we solve the 2-constraint feasibility problem 0 by maximizing 

Pi (Ji * — hi) over P^ d+1 . If the maximum value is greater than 0, we take the cor¬ 

responding yW as a feasible solution. Otherwise ([6]) is infeasible. We call the solver of problem (|fi]) 
as an ORACLE. 

Implementing ORACLE needs to compute the largest eigenvector of the matrix C = 
Y^ZiPi[Ji ~ {hi/R)I\, which can be efficiently handled by Lanczos algorithm. Here R is the 
trace bound parameter in P^ d+1 . 
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Algorithm 2 : Multiplicative Weights Update Method 

1 : Initialization: Fix at < 1/2, for each constraint, associate the weight tvj' 1 = 1. 
2: For t = 1,2 ,..., T: 

3: Normalize to get the probability vector pW. 

4: Call the ORACLE with pM. 

5: If Oracle succeeds to find a solution y W 

6: ^ = \[J i .Y^-h i ]. 

n (£+1) (t)/-i (£) \ 

7 : w\ ’ =w\’{ 1 -£?/>'). 

8 : Else 

9: Return that the problem is infeasible. 

10: End 

11: End 

12: Return Y = Y^)/T as a final solution. 


Assuming ORACLE obtains a feasible solution Y^\ we denote the normalized satisfaction of i-th 
constraint of problem ([5J as rp 1 = 7 • Y^ — h-;}. Here p is called the width parameter, satisfying 

that Vi, I Ji • y(‘> — hi < p. We update the weights as = u>^(l — where e is a 

parameter smaller than 1/2. Hence, w[ t+1 ^ is smaller than if ^ > 0, and its value increases 
otherwise. 


The algorithm is depicted in Alg.ul After T rounds, the averaged solution Y = {Y^= 1 Y' 1 ' 1 )/T 
is returned. We have the Theorem^ following Q to guarantee that either Y achieves a predefined 
accuracy or we claim that the original problem ([5]) is infeasible. 

Theorem 1. Let 5 > 0 be a given additive erro^j Mg- U either solves problem |5]l up to 6, or 
correctly concludes that it is infeasible, making 0( p ) calls to the ORACLE. 


5 Pseudo-Metric & Dimension Reduction 


In this section, we deal with the case of pseudo-metric, i.e., M is PSD. In the literature of Maha- 
lanobis metric learning, a PSD M is beneficial due to the existence of decomposition M = C 1 C, 
where C £ R. qxd . The distance function could be rewritten as d(x,y) = || Cx — Cy\\ 2 . It thus 
removes the PSD constraint and allows flexible dimension reduction by choosing q < d. 

In our setting, a PSD M could be problematic if it has an unbounded condition number. According 
to the spectral theorem, decomposition M = Q 1 AQ is applicable, where A £ R. qxq is a diagonal 
matrix with eigenvalues of M, Q £ R qxd has orthonormal rows and q is the rank of M. We can 
also choose different q to form different low rank approximation of M. Hence Q and A act as a 
dimension-reduction mapping and a dimension-wise scaling operation respectively. 

By replacing M with the decomposition and adding orthogonal constraints for Q in original BDML, 
we can conduct the pseudo-metric learning via alternatively optimizing Q and A. Specifically, when 
Q is fixed, optimizing the diagonal A is just a simple case of original BDML. Nevertheless, optimiz¬ 
ing Q with a known A is not straightforward. Especially, in the case of p-BDML, when A is fixed 
such that A £ V and k(A) < K, the problem 0 can be reformulated as below. 


min 

QgRgXd 


1 

n 


E 




Xtj . (Q t AQ) 


s.t. Xij • (Q t AQ) > p, y(i,j) £ 1, 

QQ t = I. 


(7) 


Note this learning problem is nontrivial due to the fact that optimizing Q is a quadratic constrained 
quadratic programming (QCQP). To overcome the difficulty, we have the following proposition. 


'Additive error up to 5 means that, any constraint is violated at most 5, i.e., Vi, Ji • Y — hi > —5. 
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Algorithm 3 : Gaussian Randomization Procedure 

1 : Initialization: Given the optimal solution Q*, iteration number T' , ratio 7 and tolerance e. 
2 : Fort = 1,2,..., V\ 

3: Sample £ t ~ M( 0, Q*). 

4: End 

5: £ = argmin ft ± E (i j) e s & E&- 
6 : Reshape £ from R 9dxl to R qxd . 

7: Return £ as the approximate solution of problem (|7j. 


Proposition 2. Problem 0 can be relaxed to a SDP as following, 


min 

QeS q + d 


1 

n 


E 


(iJ)€S 


■A-ij • () 


S.t. 


X-ij • (f ^2 pi 
■A-uv 9 Q b UV i 


V(u, v) G C, 


( 8 ) 


where Xij = Xjj ® A ant/ ® stands for Kronecker product. A uv is a block diagonal matrix which 
contains d identical blocks B uv G R 9Xl? . (u : v) and ( v,u)-th entries of B uv are 1 while others are 
0. b uv = 2 ifu = v, otherwise b uv = 0. C = {(it, v) G [< 7 ] x [g] |u < u} and [g] = q}. 


This proposition shows that when A is fixed we can learn Q by solving the above SDP relaxation. 
Moreover, since the equality constraints in (| 8 ]» imply Tr(Q) = q, we thus can exploit the MWU 
method again. 

Denoting the optimal objective values of problem ([7]) and (| 8 ]i as o qv and q h ,i v respectively, it is 
obvious that g s d p < g qp . Hence we aim at up-bounding g qp . Specifically, once we obtained the 
optimal solution Q* of problem ([ 8 ]), we construct an approximate solution £ problem ([7]) based on a 
Gaussian randomization procedure shown in Alg.[3] We prove the following theorem to assure that 
in the worst case Alg.[3]would possibly generate an approximate solution with approximation ratio 
U3. 

Theorem 2. If the optimal solution of problem ([ 8 ]) is Q* and £ G is a random vector generated 
from the real-valued normal distribution A/”(0, Q*), then for any 7 > 0, e > 0 and uj > 1, we have, 


Prob (a > ■yp & C<e & C < ojG • Q*) > 1 - \1\ 


- r exp ( - - (w - V 2 u - l) ) - 


rq{q + 1 ) 
2 


exp 


(r ~ If 
4 


2 (r — 1)7 

7T — 2 
e 2 

8 rdq 2 


77: 


+ eX P ( o_/^2 


_ j _ 

where r = rank(Q*) and T = \/|(^) + 1. v, ( and G are defined respectively as v = 

min (i,j)ex£, T Xij£, C = max^„ )6C - b uv \ and G = ± E (i,j)es Xij- 

Remark. This theorem indicates that with well chosen 7 and e, Alg.[3]can generate an approximate 
solution for ([7]) with guaranteed approximation ratio even in the worst case. For example, we can 
consider a real case with q = 10, d = 100 and the number of constraints \I\ = 100. By choosing 
7 = 7r/16|I| 2 , e = 40q\frd and with appropriate rank reduction on (5* as ED, it can be shown that 
after running Alg.[3]for 100 iteration, we have very high probabilitjjjsuch that 1 o qp < g sl / p < g qp , 
where ui = 10. However, the price we pay is that the orthogonal constraints are loosely satisfied. In 
practice, we found that the resultant approximate solution works well, which indicates that removing 
orthogonal constraints and transforming A to a full rank matrix may also be an alternative modeling 
choice. 


As for the pseudo-metric learning of f-BDML, we can still use the above algorithm to obtain an 
approximate solution. However, Theorem[2]does not stand in this case since not all X i3 of f-BDML 
are PSD. 

2 The probability is at least 0.999828. 
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6 Generalization Bound for BDML 


To theoretically investigate whether the distortion has an impact on the generalization ability, we 
derive the generalization bound of our BDML following the stability analysis of learning algo¬ 
rithms 0123 . 

Before diving into the details, we first introduce one assumption that we only consider the case 
where the metric matrix is full rank. Then we clarify some preliminary notations. Each training 
sample 2 inside the training set D is drawn i.i.d. from some unknown distribution V. And the range 
of z is denoted as Z. 1)' is a perturbed set of D obtained via replacing i-th sample with a new 
sample drawn from V, i.e., D l = {D\zi U z \}, where z[ ~ V. We make a mild assumption that all 
data points are contained in a T-ball, i.e., ||x|| 2 < T. We denote the learning algorithm as A, the true 
risk or generalization error as 1Z(A, D) and the empirical risk as lZ emp (A, D) = ^ £(A, zjf), 

where in our case the loss function l = M • X l:j and n = <S|. Based on (5J, we define the Uniform- 
Replace-One stability as, 

Definition 3. An algorithm A has Uniform-Replace-One stability j3 with respect to the loss function 
£ if VD G Z n , Vi € {1,..., n}. 

Here Ad means the learning algorithm A is trained on the dataset D. Note that our definition is 
stronger than the one proposed in l36l . thus being more restrictive. We have the following lemma, 
which specifies the uniform-RO stability of our BDML. 

Lemma 2. The Uniform-Replace-One stability of our BDML algorithm with respect to the given 
loss function £ is /3 = X K +UR r _ 

This lemma indicates that the Uniform-Replace-One stability of A is positively correlated with 
the bound of distortion K. It means a low distortion of metric embedding would lead to a stable 
algorithm. Although /? does not depend decreasingly on the number of samples n, which may not 
be seen as stable in some sense 0, it is clear that the stability can be controlled by the distortion. 
With this stability result, we further prove the following generalization bound, which theoretically 
explains the relationship between distortion and generalization error. 

Theorem 3. For any metric learning algorithm A with Uniform-Replace-One stability f3 with re¬ 
spect to the given loss function £, we have with probability at least 1 — 5, 

K(A, D) < U emp (A, D) + 2Ty^ (^“T“ + 3/3 )' 

Specifically, for our BDML algorithm, 


2 fiT 2 / 2 K / K \ 

1Z{A, D) < lZ emp (A, D) H--—w -j- f — + 6 K + 6 ). 

Remark. There are several interesting things to note regarding to this theorem. 

First, it explains our intuitive conjecture that a large distortion would incur overfitting during metric 
learning. It encourages us to choose a small value of K to improve the generalization ability of A. 
On the other side, setting K as small as possible is unwise, since it would constrain our hypothesis 
class too much and thus may increase both the true risk and empirical risk. Therefore, it suggests 
choosing moderately small values of K in practice via cross validation. 

Second, the generalization error tends to decrease with the increase of the dimension d. This phe¬ 
nomenon seems to be a bit counter-intuitive since it implies that our method becomes more stable in 
higher dimensional feature space. However, this does happen only if the previous assumption holds, 
i.e., M is full rank. In this case, the increase of the dimension squash the range of the spectrum of 
M since the trace bound R and distortion bound K are fixed. If M is instead rank-deficient, the 
above analysis does not stand. In particular, the bound will have a dependency on the rank of M 
and its perturbation. This means that naively increasing the feature dimension by adding zero will 
not make the bound tighter. We will illustrate this issue in the appendix. 
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Dataset 

Wine 

Iris 

Diabetes 

Waveform 

Segment 

Euc 

3.46(3.60) 

5.11(2.58) 

31.09(2.03) 

18.87(0.65) 

5.61(0.92) 

Xing 

4.04(4.00) 

6.67(3.11) 

32.09(3.56) 

16.43(1.00) 

5.26(0.60) 

LMNN 

3.08(2.07) 

4.22(1.95) 

29.70(3.20) 

18.61(0.72) 

3.69(0.70) 

ITML 

1.15(2.07) 

4.44(2.57) 

29.96(2.97) 

15.94(0.83) 

5.02(0.70) 

BoostMetric 

2.31(2.18) 

3.56(2.52) 

26.78(2.12) 

16.86(0.90) 

4.21(0.48) 

/f-BDML 

2.83(1.3) 

3.11(2.61) 

27.57(2.21) 

15.78(0.60) 

4.21(0.79) 

f-BDML 

2.26(2.30) 

2.44(1.64) 

26.43(2.30) 

15.34(0.72) 

3.62(0.34) 


Table 1: Comparison of average test errors (%) and standard deviations on UCI datasets. 


7 Experiments 

We present empirical evaluations of our BDML algorithm on a wide range of tasks, including clas¬ 
sification on several UCI datasets 12, domain adaptation on medium-scale datasets l35ll . and face 
verification on the large-scale LFW dataset ED- Before presenting the results, we first discuss a 
practical strategy to speed up the bisection method, since it is sometimes hard to estimate a tight 
interval of the optimal objective value in advance. Specifically, we select several fixed upper bounds 
and then solve the convex feasibility problem |4]) in parallel. If the trial is successful, we use it to 
shrink the upper bound, otherwise we shrink the lower bound. This procedure provides us a largely 
reduced interval with time cost as small as one call of MWU solver. Note that we set parameters via 
cross validation. The impact of different parameters and runtime are provided in the appendix due 
to space limits. 

7.1 Classification 

We first conduct classification experiments on several UCI datasets, including Wine, Iris, Diabetes, 
Segment and Waveform, to validate the effectiveness of our BDML. We randomly split datasets 
into 70% for training and 30% for testing and report the average test errors and standard deviations 
by repeating the random splits for 10 times. We compare with the baseline of Euclidean metric 
and several strong competitors like Xing Edl . LMNN l39l , ITML Q and BoostMetric J38]. The 
neighborhood size of kNN classifier is 3 and all metric M are initialized as the identity matrix. 
We carefully set other parameters for these methods via cross validation. The results are listed in 
Table. [T| in which the best ones are bolded. Both our /f-BDML and f-BDML perform well on these 
datasets. Especially, f-BDML is consistently better than /r-BDML which validates the effectiveness 
of triplet constraints as suggested by l39l . 

We then demonstrate how the performance of the kNN classifier varies according to the condition 
number of the learned metric in Fig. [2] The x-axis is the natural logarithm of the condition number. 
It is clear from the figure that, the average test errors of both /r-BDML and f-BDML first decrease 
and then increase when the condition numbers become larger. These results provide strong evi¬ 
dence to support our previous analysis that a largely distorted metric space leads to overfitting and a 
small distortion may result in underfitting. And our BDML effectively controls the distortion, thus 
improving the generalization ability. 

7.2 Domain Adaptation 

We also apply our BDML to domain adaptation problems, under both unsupervised and semi- 
supervised settings. In the former case, we labeled samples in a source domain for training and 
want to test the unlabeled samples in the target domain. While in the later setting, apart from labeled 
samples in a source domain, a small number of labeled samples in the target domain are also accessi¬ 
ble during training. We use the same dataset as in f35ll . which contains 2,533 images of 10 categories 
from 4 domains: Caltech, Amazon, Webcam, and Dslr. We exploit the same 10 categories as fTOl to 
all the four domains. Experiments are repeated with 20 fixed train/test splits offered by lf35l . We set 
the number of neighbors of kNN to 1 as other methods. Since the original SURF feature is of 800 
dimension, we perform pseudometric learning with f-BDML and initialize the dimension-reduction 
mapping via PCA. The size of mapping matrix is set to 30 x 800 according to cross-validation. 
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Figure 2: Average test error varies with the natural logarithm of condition number on (a) Segment 
and (b) Iris datasets. 


Methods 

A->C 

A—► W 

C —>• A 

C—»D 

D —► A 

D —>■ W 

W —>■ A 

W —► D 

OrigFeat 

22.6(0.3) 

23.5(0.6) 

20.8(0.4) 

22.0(0.6) 

27.7(0.4) 

53.1(0.6) 

20.7(0.6) 

37.3(1.2) 

SGF 

35.3(0.5) 

31.0(0.7) 

36.8(0.5) 

32.6(0.8) 

32.0(0.4) 

66.0(0.5) 

27.5(0.5) 

54.3(1.2) 

GFK 

35.6(0.4) 

34.4(0.9) 

36.9(0.4) 

35.2(1.0) 

32.5(0.5) 

74.9(0.6) 

31.1(0.8) 

70.6(0.9) 

LMNN 

35.7(0.5) 

32.9(0.8) 

33.8(0.7) 

31.5(1.6) 

33.7(0.4) 

75.1(0.8) 

30.8(0.7) 

67.6(1.0) 

DML-eig 

35.0(0.7) 

28.9(0.7) 

33.7(0.7) 

32.7(1.3) 

33.4(0.3) 

78.2(0.8) 

32.5(0.9) 

72.4(0.6) 

f-BDML 

37.2(0.3) 

35.2(1.0) 

35.2(0.7) 

33.4(1.3) 

37.1(0.6) 

78.6(0.7) 

33.2(0.8) 

73.8(0.6) 

OrigFeat(semi) 

24.0(0.3) 

31.6(0.6) 

23.1(0.4) 

26.5(0.7) 

31.3(0.7) 

55.5(0.7) 

30.8(0.6) 

44.3(1.0) 

ITML(semi) 

27.3(0.7) 

36.0(1.0) 

33.7(0.8) 

35.0(1.1) 

30.3(0.8) 

55.6(0.7) 

32.3(0.8) 

51.3(0.9) 

SGF(semi) 

37.7(0.5) 

37.9(0.7) 

40.2(0.7) 

36.6(0.8) 

39.2(0.7) 

69.5(0.9) 

38.2(0.6) 

60.6(1.0) 

GFK(semi) 

37.8(0.4) 

53.7(0.8) 

42.0(0.5) 

49.5(0.9) 

45.0(0.7) 

78.7(0.5) 

42.8(0.7) 

75.0(0.7) 

LMNN(semi) 

36.6(0.6) 

49.6(0.9) 

43.3(0.5) 

50.3(1.3) 

42.0(0.7) 

78.6(0.7) 

42.3(0.6) 

72.8(1.1) 

DML-eig (semi) 

27.8(0.7) 

40.5(1.0) 

43.3(0.6) 

45.1(1.6) 

43.4(0.6) 

80.5(0.9) 

40.8(0.7) 

76.8(0.9) 

f-BDML(semi) 

38.8(0.3) 

55.8(1.1) 

44.6(0.6) 

54.0(1.1) 

43.9(0.6) 

83.8(0.5) 

44.8(0.6) 

79.2(0.7) 


Table 2: Comparison of unsupervised (upper part) and semi-supervised (lower part denoted with 
“semi”) domain adaptation. Mean test accuracy (%) and standard error (inside parentheses) are 
reported. 


Table [2]presents the mean test accuracy and standard errors of various metric learning based meth¬ 
ods, including LMNN (39j, ITML (35), SGF HD, GFK (10) and DML-eig 02), where A->C means 
the adaptation from source domain A ( i.e ., Amazon) to the target domain C ( i.e ., Caltech). And for 
fair competition, we adopt the best results of GFK under the PCA subspace setting reported by ED- 
The results in the original Euclidean space are denoted as OrigFeat. In most subtasks of these two 
settings, our f-BDML outperforms than other competitors which demonstrates the strength of the 
proposed pseudo-metric learning scheme. 


7.3 Face Verification 

Finally, we apply our BDML to an unconstrained face verification task, using the large-scale LFW 
dataset that contains 13,233 face images of 5,749 people. It is challenging due to the large variations 
of faces in illumination, expression, pose, resolution, etc. There are 6 standard protocols ITS) for 
evaluating results. We use the setting called “Image-Restricted, Label-Free Outside Data”, where we 
can only access the provided labeled pairs of faces during training. Thus we only compare pseudo¬ 
metric learning of p-BDML since using triplet constraints would violate this setting. The dateset is 
organized in 10 folders and each of them contains 300 similar pairs of faces and 300 dissimilar ones. 
The reported accuracy is obtained via cross validation on the provided 10 folds. 

Current state-of-the-art methods under this setting often build various classifiers and combine multi¬ 
ple types of visual descriptors. Since we primarily aim at validating the effectiveness of BDML, we 
do not carry out intensive feature engineering or build complex similarity measurements. Instead, 


10 


















































Methods 

Mean Cond. Num. 

Mean Acc. 

Std Err. 

Xing 

— 

0.7593 

0.0059 

ITML 

1.2037 

0.7812 

0.0045 

LDML 

— 

0.7927 

0.0060 

DML-eig 

3228.4 

0.8127 

0.0230 

Sub-ITML 

1.2157 

0.8145 

0.0046 

KISSME 

— 

0.8308 

0.0056 

Sub-ML 

1171.397 

0.8330 

0.0026 

p-BDML 

8.3771 

0.8632 

0.0022 


Table 3: Results of face verification on LFW, where “—” means not applicable due to no public 
implementation or an unbounded condition number. 


we use the public “funneled” SIFT feature^] and regard the learned distance metric as the similarity 
measure. Since the dimension of raw SIFT feature is too large (« 4k), we reduce it to 800 via PCA 
before pseudo-metric learning with p-BDML and set the size of dimension-reduction mapping as 
300 x 800. 

In Table [3] we present results of p-BDML and other metric learning based methods including 
Xing [4JFITML Q, LDML fH, KISSME ED, DML-eig [g2, Sub-ITML and Sub-ML 0. We 
also report the average condition number. The ITML methods optimize a logDet regularizes which 
yield condition numbers close to one. It suggests that the metric space is not sufficiently distorted for 
good generalization performance. While the F-norm based regularization methods (e.g., Sub-ML) 
yield too large distortion, which is not good for generalization. These results support our theoretical 
analysis in Sec.[6]again. In contrast, our BDML method obtains an appropriate condition number, 
which results in decent generalization performance. Full comparison (including ROC curves) with 
other non-distance based methods is presented in the appendix. 

8 Conclusions 

In this paper, we propose the bounded-distortion metric learning (BDML), which well-balances 
the fitness to data and the distortion of metric embedding. Lor Mahalanobis metric space, BDML 
leads to a bounded condition number metric learning method, which possesses intriguing properties. 
We propose an efficient learning algorithm and further provide theoretical analysis, which explains 
why the distortion is a key ingredient to ensure good generalization ability. Also, we generalize to 
pseudo-metric learning and propose an approximate solver based on the semidefinite relaxation and a 
randomised algorithm. Empirical results validate that our BDML leads to both better generalization 
and well-conditionness. In future, we would like to extend the distortion to non-Mahalanobis metric 
and design corresponding approximation algorithms. 


3 http://lear.inrialpes.fr/people/guillaumin/data.php 
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9 Appendix 

9.1 Proofs in Section [3] 
Proof of Proposition [l] 


Proof. Note that for any two different points x, y, we have, 

dM{f{x),f(y)) = (x - y) T M{x - y) 
di(x,y ) (x - y) T (x - y) 

It is easy to see that above equation is the Rayleigh quotient of the PSD matrix M. Therefore, 


A ^ d M (f{x),f(y )) 

/ 'min _ 7 / \ 

di{x,y) 


< A 


max? 


where A m i n and A max are the minimum and maximum eigenvalue of M respectively. According to 
the Definition 2 setting r = A m i n , we can find that for any c > , it is always true, 

Amin 

r • dx(x, y) < dy(f(x),f(y)) < cr ■ d x (x,y). 


Hence the distortion of the metric embedding is inf{c|c > Amax | _ 

Q.E.D. 


9.2 Proofs in Section |4j 
Proof of Lemma [T| 


Proof. If Y* is a feasible solution of problem (|5j, then • Y* > hi,fi. Since Vi, 0 < pi < 1, we 
have f2iLi Pi (Ji • Y* ~ hi) V 0- Hence, Y* is also a feasible solution of problem (|6j. 

If there exists a probability vector p such that the problem ^ is infeasible, then for all Y G Pfl +1 , 
we have Y^IL\Pi (V, • Y - hf) < 0. Therefore, the original problem (|5j» is also infeasible, since 

otherwise there exists a solution Y such that Y G P^ d+1 and Pi *Y — h^j > 0. 

Q.E.D. 


Proof of Theorem U 


Proof. In the t-th round, we run the ORACLE with a probability distribution p l f> as input. 

If the ORACLE declares the problem (|6]) is infeasible, then due to Lemma. [T] the original problem is 
correctly concluded as infeasible. 

On the other hand, if this situation never happens during the iteration, i.e., for any round t, ORACLE 
succeeds to find a solution 7® to problem then we can get the following inequality, Vt = 

1,2.r, 

m m 

, —1 r -_i 
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Then, equipped with the above inequality and Theorem 2 in [23], we have, for any constraint i. 


0 <Ei!‘’+'E 


ft) I , ln M 


H t = 1 


ln(m) 


(1 + e) 


Y (Ji • Y (t) ~ h* 


+ ~Y \ji»Y (t) ~hi I + 


ln(m) 


ter_ 


< 


(1 + e) 


1 

Y ( Ji • y(t) - h i) + 2eT + 


ln(m) 


Here 71 denotes the set of index t when Ji • Y ( ' t] — hi < 0. Divided by T on both sides of the 
above inequality and with some rearrangement, we can obtain, 




(*) 


— hi> — 


1 + £ 


2s + 


ln(r 


sT 


Note that Y = (Yt=i )/Tis the solution returned by the multiplicative weights update method. 
Now, if we set £ = 1^, T = C2f> where Ci, C 2 are positive real numbers, then we have. 


P ln ( m ) A _ P , 1 m 

1 + eV 2 + eT ) p + CiS^ 1 + CiC 2 ^' 


With some calculation, we can find that if we choose 0 < C\ < 1/2 and set C 2 = — c 1 ( 2 Ci-i) 1 

_ 1 _A fnr amt r'rtnofroint n 

Ap! 

Q.E.D. 


e.g., C\ = 1/4, C 2 = 8, then Ji mY — hi > —5/(1 + j-) > —5 for any constraint i. 


9.3 Proofs in Section [5] 
Proof of Proposition [2] 


Proof. We first restate problem (|7]l as below. 


min 

QeR qxd 

s.t. 


1 

n 


E 


(iJ)€S 


Xfj • (Q t AQ ) 


Xij • (Q r AQ) > fj,, 
QQ t = I. 


(10a) 

(10b) 


We can denote the vectorization of Q as (/ and rewrite the above problem as the following standard 
formulation of quadratic constrained quadratic programming (QCQP), 


mm 

S.t. 




(ij)es 

C XijC, > /A, 
C A uv (/ — b u 


CXijC 


V(u')el, (Ha) 

\/(u,v) G C, (lib) 


where the set of margin constraints in ( |10a| ) corresponds to the one in (|1 1 a| ) and the set of orthogonal 
constraints in ( |10b| ) corresponds to the one in (JTTbJi. Specifically, X\j = X,j Q A and 0 stands 
for Kronecker product. And we denote the index set of the upper triangular part of a g-dimensional 
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squared matrix as C = {(u, v) £ [g] x [g] \u < u} where [g] = {1,2,..., g}. Then for each element 
(u, v ) of C, we have A uv is a block diagonal matrix which contains d identical blocks B uv £ R. qxq 
as following. 


B u 


A uv — 


B u 


where 


\ 

B U y / 


B uv — 


1 ... 

... 1 : ... 


V 


\ 


u 

v 




i.e., (u,v) and (u,rt)-th entries of B are 1 while others are 0. And b uv = 2 if u = v, otherwise 

blLV = 0 * 


Based on [29], to derive the SDP relaxation, we can easily observe the following, 

( V = Tr:.V„CC') = Xij • CC T - 

Thus setting Q = CC T - we can write the equivalent form of the above QCQP as below. 


1 


V X vi • Q 

Z-^tiA\GS lJ ^ 


mm 

eK 9 dxl n 1 

s.t. Xij • Q > p, 

Auv * Q buv 5 
rank(Q) = 1. 


V(u, v) £ C, 


By removing the last rank constraint, we can obtain the desired SDP relaxation. 

Q.E.D. 

Remark. Note that the orthogonal constraints in the original QCQP problem imply that Tr(Q) = q 
in the SDP relaxation problem. 


Before obtaining our Theorem [2] we need to introduce three lemmas as below. First, we state 
the Lemma 1 in lf30l as below which gives the polynomial tail bound of the left-side inequality 
£ t H£ < yE ^ 1 //£]. Reader can refer to the paper for details of the proof. 

Lemma 3 (Left-side Polynomial Tail Bound). Let H £§^,2 € Suppose £ € is a random 
vector generated from the real-valued normal distribution J\(0, Z). Then for any 7 > 0, 

Prob (£ T f?£ < 7 E[£ t 7 T£]) < max{/y, _y / |, 


where r = mm{rank(H), rank(Z)}. 


Then we derive a right-side exponential tail bound as below. 

Lemma 4 (Right-side Exponential Tail Bound). Let H £ §{_ , Z £ S{_. Suppose £ £ is a 
random vector generated from the real-valued normal distribution 7V"(0, Z). Then for any 7 > 1, 

Prob (£ T fT£ > 7 E[£ T fL£]) < r exp (7 - y/2j- 1^ , 

where f = mm{rank(H), rank(Z)}. 
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Proof. Let r = rank(Z). Since Z £ S+, then we can write Z = UU T for some U £ R. dxr . We 
consider the eigen decomposition of the real symmetric matrix U T HU = LDL T - ( A,/,// , 

where L = [h,l 2 , • • •, l r \ £ K rXT ’ is an orthogonal matrix and D = diag{ Ai, A 2 ,..., A r } with 
Ai > A 2 > • • • >_A r > 0. Note that Ai = 0 for all i > r due to the fact that U T HU has rank at 
most r. Denoting £ ~ Af( 0, I r ), then we can easily check that [/£ ~ Af(Q, Z), i.e., (7£ is statistically 
identical to £. 


Therefore, we have that, 

Prob (£ T fL£ > 7 E[£ T fL£]) = Prob (fU r HU£ > 7 E[£ T t/ T fft/£]) 

= Prob > 7 E[^AiG7£) 2 ]J 

\i=l 2=1 / 

Denoting Uj = ij £, we have that E[uj] = 0 and E[u|] = 1, i.e., Ui is a standard normal variable. 
Hence, 


Prob (£ T fT£ > 7 E[£ t 7T£]) = Prob £ Ku 2 > 7 E[^ A t u 2 } 

\i= 1 i= 1 / 

( f f \ 

Ai« 2 > 7 55) 

i=l i=l / 

= Prob ^5^ > 7 j , 

where Ai = A,;/^[ = i Ai for i = 1,... ,f. Note that Ai = 1 and Ai > A 2 > • • • > A r > 0. 

Then 

Prob ^55 AiU 2 > 7 ^ = 1 — Prob ^55 — 7^j 

< 1 — Prob (uf < 7 & ... & u 2 < 7 ) 

= Prob (u 2 > 7 11 ... || u 2 > 7 ) 

f 

< 55 Prob (u 2 > 7 ) 

Z=1 

< fexp ( 7 - y/2 7 - 1^ 

Note that the last step is due to the inequality in Lemma © and the fact that u'f is a x' 2 random 
variable with 1 degree of freedom. 


Q.E.D. 


At last, we derive another two-side exponential tail bound for our own purpose. 

Lemma 5 (Two-side Exponential Tail Bound). Let Q* £ Sf 1 be the optimal solution of problem 
& and A uv £ § ?d be any constraint matrix corresponding to index set C in problem ©■ Suppose 
£ £ M. qd is a random vector generated from the real-valued normal distribution A/”(0, Q*). Then for 
any e > 0 


Prob (|£ t A ot £ - E[£ T AU£]| > e) < f 


exp - 


(r-lf 


+ exp - 


8 rdq 2 


where r = min {rank{A uv ),rank{Q*)} and t = (f l) 


Proof. Let r = rank{Q*). Since Q* £ we can write Q* = UU T for some U £ M. qdxr . 
And since A uv is symmetric, we can write the eigen-decomposition of matrix U T A UV U = LDL T , 
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where L = [Zi, I 2 , ■ ■ ■ , l r \ G K rxr is an orthogonal matrix and D = diag{ Ai, A 2 ,..., A r } with 
Ai > A 2 > •• • >_A r > 0. Note that A, = 0 for all i > f due to the fact that U T HU has rank at 
most f. Denoting £ ~ AA(0, I r ), then we can easily check that £/£ ~ A/”(0, Z), i.e., U£is statistically 
identical to £. 

Therefore, we have that, 

( r f \ 

iE^(^") 2 -E e [ a *(^") 2 ]I M 

i=l i=1 / 

Denoting iq = ij £, we have that E[uj] = 0 and E[u 2 ] = 1, /.e., Ui is a standard normal variable. 
Hence, 

Prob (|£ T A™£ - EK T A™£]I > e) =Prob (| ^ A i(u 2 - 1)| > e J 


2=1 


< Prob 




\ Z-* *\ Z^ 

^ i=i \\ i= 1 

=Prob I ||£)|| f 


- l) 2 > e 


Note that, 


- !) 2 > e 


\D\\ f = \\U t A uv U\\ f < \\A uv \\ f \\U\\ 2 f = ||A OT || F Tr(Q*) < qV2d. 


Here the first equality uses the fact that Frobenius norm is rotation-invariant. The second inequality 
uses the Cauchy-Schwarz inequality. While for the last equality, we can see that if u = v, then 
11Ann|= Vd, otherwise ||H u J|f = a/ 2d. Moreover, we have Tr(Q*) = q due to the fact Q* is 
the optimal solution of problem ([8) satisfying the orthogonal constraints. Therefore, we have that. 


Prob (\( r A uv ^ - E[e T A„^]| > c) <Prob ^(u 2 — l) 2 > 

=1 ~ Prob (^T(u 2 - l) 2 < 


*=1 

2 1 \2 


2 dq 2 


<1 — Prob (uf — l) 2 < 


=Prob ( (uf — l) 2 > 


2 rdq‘ 


k ... k(u 2 - l) 2 < — 


2rdq 2 


2 rdq 2 


(u 2 l ) 2 > - 


2 rdq' 




= ^Prob(\uf - 1 | > 
2=1 ' 
f 

=£ 


i =1 


Profr it, > 1 + 


q\j2rd) 

-LP) 

qVZrdJ 


+ Prob [ u~ < 1 — 


q\j2rd) 


<r 


exp - 


(r-iy 


■ exp - 


8 rdq 2 


in Lemma 0 


where r = ^ | yj~§d + 1^ -In the last step, we use the exponential tail bound of Chi-square variable 

Q.E.D. 
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Remark. Note that if u = v then rank(A uv ) = d otherwise rank(A uv ) = 2d. Thus, we have that 
r < 2d, which could be used to eliminate the variable f in the tail bound. 

Proof of Theorem [2] 


Proof. First, We have the following, 

Prob (v > hi & C < e & £ T G£ < wE[£ T G£]) 

> 1 - Prob (3(i,j) Z T Xijti < 7m) - Prob (3{u,v) |£ T A,„„f - M > e) - Prob ($ T G£ > wE[£ T G£]) 

>1 - Y, Prob(CXijt< 7m) - 1] Prob(\CA uv Z-b uv \ > e) ~ Prob > coE[f 

(i,j)ex (u,v)ec 

= 1-1X1+ Prob > 7/i) - I] Prob (\£ r A uv £ - b uv \ >e) - Prob > uE[$ T G 

(i,j)£X (u,v)£C 


Here we use the fact that. 


We denote 


E[£ T Xjj-£] = X tj • Q* > fj. 
E[£ T ^U£] = A uv • Q* = b n 


Ti = Prob (€ T Xijt, >7 P) , 

T 2 = ^2 Prob (lf T Auvt, - b uv \ > e) 

(u,v)€C 

T 3 = Prob (e T GC > wE[£ t G£]) . 


and 


ri = minjmax rank(Xij),rank(Q*)}, 

(i,i) 

r 2 = min{max rank(A uv ),rank(Q*)}, 

(u ,v) 

r 3 = min {rank(G),rank(Q*)}. 

According to the first constraints in ([8} and the lemma of left-side polynomial tail bound, we have, 

Ti> ]T Prob^ T X^> 7 E[£ T X^]) 

(ij)ex 

> \P\ - max{y7, 2<y7 ^_ ^ - }) ■ 

Then according the second constraints in ([8]i and the lemma of two-side exponential tail, we have, 

T 2 = ^ Prob{\eA uv Z~E[t; T A uv t}\ >e) 

(u,v)ec 


< 


r 2 q(q + l) 


exp - 


(r-lf 


exp - 


8 r 2 dq 2 


where r = 
have, 


(-\P2+i) 

\ q v r2d ) 


e ' 2 1 ) .At last, according to the lemma of right-side exponential tail bound, we 


T 3 < r 3 exp (w - \j2u - l)^ . 
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Therefore, based on all above inequalities and facts that rq < r, 7~2 < r and r 3 < r, we can derive 
that. 


Prob (u > 7 /r & £ < e & £ T G£ < wG • > 1 — \I\ max 



2 (r - 1 ) 7 \ 

7T- 2 ) 


— r exp 




rq(q + l) 
2 


exp 




+ exp 



Q.E.D. 


9.4 Proofs in Section |6] 

Proof of Lemma |2] 

Proof. The given loss function is £(„4, X l3 ) = M • . First, note that, 

M • Xij = ( Xi — Xj) T M(xi — Xj), 

(xi-Xj) T M(xi-Xj) T 

= - —vTi -Mzi - Xj) [x-i - Xj), 

(Xi - Xj) T [Xi - Xj) 

Then, relying on the property of the Rayleigh quotient, we can obtain that, 


A r 


L j \\2 — 


< M • Xj < A 


ij — ''max | 




where A max and A m i n are the maximum and minimum eigenvalues of M. 

Thus, denoting the replace-one dataset and the corresponding metric as D k and M k respectively, 
we can derive that, 

|- £{A D k,X i:j )\ = I M • - M k • Xij \, 




k . I 

min 


I *^j II 2 5 


< 4T (A max + A min ) , 

2 ,KR R. 

< 4 (k + i)f?r 2 


where A max and A^ in are the maximum and minimum eigenvalues of M and M k respectively. The 
second inequality uses the fact that Xi and x 3 are in a T-ball. And the next one relies on that, for 
both M and M k . it is true that 


Amax A R A,, 


K KR 

s 1 n '< M > s ^r- 


Therefore, the Uniform-Replace-One stability /3 = R K+R > Rr 


Q.E.D. 


Remark. Note if our previous assumption is violated, i.e., M is rank-deficient, then this stability 
result does not stand any more. In particular, we will have, for any M, 


Amax R A n 


K rN KR 

< —Tr(M) < -, 

r r 


where r is the rank of M and r < d. Then, the stability could be rewritten as /3 = 
where f = min {rank(M),rank(M k )}. This resultant stability is case-dependent, thus being less 
favourable. This argument is important for our later explanation why our generalization bound will 
not become tighter and tighter via trivially increasing the feature dimension. 

To prove the Theorem [3] we first state one part of Lemma 9 in J5) as below. 
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Lemma 6 (Variance Bound). For any algorithm A and loss function t such that 0 < £ < L, we 
have for any different i,j £ {1,. .., n}, 


E_d 


D) - lZemp(A , D)) 


Remark. Here D and l)‘ are defined in 

Proof of Theorem |3| 


2 < ^ + 3LE {D{ j z > i} [| £(A D ,Zi) - £(A D i,Zi )|]. 
Sect.[6]and D 1 = {D\zi U 2 '}. 


Proof First, for the given loss function £, we have that, 

l{A,Xij) = Mm X i:j 


f ^max 11 Xj \ \ 2 


< 


AKRT 2 


Then due to the definition of uniform-RO stability, we have that 

E {£>LH'} [I £{A D ,Zi) - £(A D i,Zi)\] < 13. 

Therefore, according to the above lemma of variance bound, we can obtain that, 


E 


D 


< 


(R-(A, D) - lZ emp (A, D)f 

8K 2 R 2 T 4 12KRT 2 /3 
nd 2 d 


Then based on Chebyshev’s inequality, we can derive that, 

Prob (R(A, D) - Tlemp{A , D ) > e) 


E_d 


< 


{1Z(A, D) — TZemp{A , D )) 


< 


1 / 8K 2 R 2 T 4 12KRT 2 
e 2 I n d 2 d 


Setting the right hand side of the above inequality as 6, we thus have with probability at least 1 — <5 
that. 


KR f2KRT 2 

R(A, D ) < U emp (A, D) + 2VJ^( —+ 3(3 ). 


By substituting the Uniform-Replace-One stability 3. we can obtain the following specific bound. 


7Z(A, D) < TZ emp (A, D) + 


2RT 2 2K fK 


+ 6# + 6 . 


n 


Q.E.D. 

Remark. As aforementioned, there is one counter-intuitive property of our generalization bound 
that it becomes tighter when the feature dimension d increases. However, it is the case only if our 
previous full-rank assumption on M holds. If this assumption is violated, e.g.. in the sparse high 
dimensional feature space, the above result does not stand any more. Therefore, it rules out the possi¬ 
bility that trivially increasing the feature dimension by adding zeros will improve the generalization 
ability. 
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K 

Mean Cond. Num. 

Mean Err. 

Std Err.(±) 

1.0e+l 

1.454 

4.222 

2.446 

1.0e+2 

2.165 

3.333 

2.160 

1.0e+3 

2.158 

4.222 

2.147 

1.0e+4 

37.567 

5.333 

2.859 

1.0e+5 

2962.094 

6.222 

3.443 


Table 4: Performance varies with distortion bound K. 


R 

P 

Mean Cond. Num. 

Mean Err. 

Std Err.(±) 

l.Oe+2 

l.Oe+2 

2.682 

4.222 

2.210 

l.Oe+2 

5.0e+2 

1.638 

4.222 

1.640 

l.Oe+2 

l.Oe+3 

2.354 

5.111 

1.500 

l.Oe+3 

l.Oe+3 

2.768 

5.778 

2.608 

l.Oe+3 

5.0e+3 

1.611 

4.889 

1.753 

l.Oe+3 

1.0e+4 

2.360 

7.111 

2.295 


Table 5: Performance varies with trace bound R and width p. 


10 Impact of Parameters 


In this section, we study how the performance of our BDML algorithm varies with several important 
parameters, including distortion bound K , trace bound R and width p. Except running time which 
needs large scale data, we experiment with all other parameters on the UCI Iris dataset. As in Sect. 
|7.1| we randomly split the dataset into 70% for training and 30% for testing and report the average 
test error and its standard deviation by repeating the random splits for 10 times. 


10.1 Distortion Bound I< 

We first study the effects of distortion bound K. We fix the number of iteration T = 1000, the 
margin of p-BDML p = 1, the trace bound R = 100 and the width p = 500. And we report the 
mean condition number, mean test error and the standard deviation of test error. The results are 
listed in Table |10.X| From the table, we can find that with K increases, the resultant mean condition 
number becomes larger. It is partly because that as I\ increases, the bounded-distortion constraint 
becomes easier to satisfy, thus encouraging MWU method puts more weights on other constraints 
like the margin ones. Hence the learned metric embedding is more distorted to fitting the training 
data. Moreover, with K increases, the test error first decreases and then increases which matches 
our analysis in Sect. [6] 


10.2 Trace Bound R and Width p 


We now study the effects of trace bound R and width p. These two parameters are correlated in 
a sense that the width p should not be much smaller than the trace bound R since otherwise the 
constraint of width, i.e., V*. \Ji • — hi\ < p will not stand. We fix the number of iteration 

T = 1000, the margin of p-BDML p = 1 and the distortion bound K = 1000. Same measurements 


are reported in Table 10.2 


From this Table, we can see that if width p is set to be much larger than the trace bound R, the 
resultant mean test error tends to become larger. This may due to the fact that with the same number 
of iteration, the larger the width, the smaller the overall quality of the solution of the MWU method 
deteriorates which is matched to the analysis in l23l . On the other side, if width p is nearly equal to 
the trace bound R, the aforementioned constraint of width will be violated sometimes. Therefore, 
in practice, we find that setting R sa 10<7 and p sa 5 R yield good results. Here d is the dimension of 
the input feature. 
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Figure 3: Runing Time of MWU solver. 


Methods 

OrigFeat 

SGF 

GFK(PCA) 

LMNN 

DML-eig 

f-BDML 

A —> C 

22.6(0.3) 

35.3(0.5) 

35.6(0.4) 

35.7(0.5) 

35.0(0.7) 

37.2(0.3) 

A—>D 

22.2(0.4) 

30.7(0.8) 

35.2(0.9) 

32.3(1.0) 

27.6(1.1) 

33.0(1.0) 

AaW 

23.5(0.6) 

31.0(0.7) 

34.4(0.9) 

32.9(0.8) 

28.9(0.7) 

35.2(1.0) 

C —> A 

20.8(0.4) 

36.8(0.5) 

36.9(0.4) 

33.8(0.7) 

33.7(0.7) 

35.2(0.7) 

C —> D 

22.0(0.6) 

32.6(0.8) 

35.2(1.0) 

31.5(1.6) 

32.7(1.3) 

33.4(1.3) 

C-> W 

19.4(0.7) 

30.6(0.8) 

33.7(1.1) 

26.0(1.2) 

29.2(1.3) 

29.6(0.8) 

D —> A 

27.7(0.4) 

32.0(0.4) 

32.5(0.5) 

33.7(0.4) 

33.4(0.3) 

37.1(0.6) 

D —> C 

24.8(0.4) 

29.4(0.5) 

29.8(0.3) 

29.4(0.5) 

29.8(0.3) 

32.6(0.6) 

D —S> W 

53.1(0.6) 

66.0(0.5) 

74.9(0.6) 

75.1(0.8) 

78.2(0.8) 

78.6(0.7) 

W —> A 

20.7(0.6) 

27.5(0.5) 

31.1(0.8) 

30.8(0.7) 

32.5(0.9) 

33.2(0.8) 

W^C 

16.1(0.4) 

21.7(0.4) 

27.2(0.5) 

26.3(0.7) 

27.0(0.5) 

28.5(0.6) 

W —!> D 

37.3(1.2) 

54.3(1.2) 

70.6(0.9) 

67.6(1.0) 

72.4(0.6) 

73.8(0.6) 


Table 6: Comparison of unsupervised domain adaptation. Mean accuracy (%) and standard error 
(inside parentheses) are reported. The best performance is denoted in bold type. 


10.3 Running Time 

We now investigate the running time on the LFW dataset due to its high-dimensional feature. Since 
the main component of our BDML algorithm is the MWU method, we thus study how its running 
time varies with respect to the maximum number of iteration and dimension of input feature. The 
trace bound R and width p are both fixed as 3d + 1 as aforementioned. We implement the algorithm 
as a single-thread Matlab program. And all our experiments are conducted on a server with Intel 
Xeon E5 CPU(2.6GHz) and 128G RAM. In particular, we test following values of dimension d of 
input feature, 10, 50, 100 and 300. And for each dimension, we set the maximum number of iteration 
as 100, 500, 1000, 2000 and 5000 and keep track of the corresponding running time. The natural 
logarithmic of all results are plotted in Fig. [3] Generally, for 100-dim feature, it takes around 146s 
to finish 1000 iterations of MWU. 


11 Full Results of Experiments 

We in this section demonstrate the comprehensive results of our experiments. 
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Methods 

OrigFeat 

ITML 

SGF 

GFK(PCA) 

LMNN 

DML-eig 

MMDT 

t -BDML 

A —> C 

24.0(0.3) 

27.3(0.7) 

37.7(0.5) 

37.8(0.4) 

36.6(0.6) 

27.8(0.7) 

36.4(0.8) 

38.8(0.3) 

A—>D 

28.1(0.6) 

33.7(0.9) 

34.5(1.1) 

47.0(1.2) 

43.80.2) 

33.0(0.9) 

56.7(1.3) 

46.5(0.9) 

A —W 

31.6(0.6) 

36.0(1.0) 

37.9(0.7) 

53.7(0.8) 

49.6(0.9) 

40.50.0) 

64.60.2) 

55.8(1.1) 

C —> A 

23.1(0.4) 

33.7(0.8) 

40.2(0.7) 

42.0(0.5) 

43.3(0.5) 

43.3(0.6) 

49.4(0.8) 

44.6(0.6) 

C —> D 

26.5(0.7) 

35.00.1) 

36.6(0.8) 

49.5(0.9) 

50.30.3) 

45.1(1.6) 

56.5(0.9) 

54.00.1) 

C4W 

25.2(0.8) 

34.7(1.0) 

37.2(0.9) 

54.2(0.9) 

56.2(1.5) 

58.90.2) 

63.80.1) 

56.0(1.0) 

D —> A 

31.3(0.7) 

30.3(0.8) 

39.2(0.7) 

45.0(0.7) 

42.0(0.7) 

43.4(0.6) 

46.90.0) 

43.9(0.6) 

D —C 

22.4(0.5) 

22.5(0.6) 

30.2(0.7) 

32.7(0.4) 

33.4(0.4) 

31.9(0.4) 

34.1(0.8) 

35.4(0.3) 

D —» W 

55.5(0.7) 

55.6(0.7) 

69.5(0.9) 

78.7(0.5) 

78.6(0.7) 

80.5(0.9) 

74.1(0.8) 

83.8(0.5) 

W-» A 

30.8(0.6) 

32.3(0.8) 

38.2(0.6) 

42.8(0.7) 

42.3(0.6) 

40.8(0.7) 

47.7(0.9) 

44.8(0.6) 

W^C 

20.8(0.5) 

21.7(0.5) 

29.2(0.7) 

32.8(0.7) 

32.2(0.7) 

32.8(0.6) 

32.2(0.8) 

33.3(0.6) 

W —s- D 

44.3(1.0) 

51.3(0.9) 

60.60.0) 

75.0(0.7) 

72.8(1.1) 

76.8(0.9) 

67.00.1) 

79.2(0.7) 


Table 7: Comparison of semi-supervised domain adaptation. Mean accuracy (%) and standard error 
(inside parentheses) are reported. The best performance is denoted in bold type. 
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Figure 4: ROC curves on LFW dataset. 


11.1 Domain Adaptation 

For domain adaptation, we show the full results of all 12 possible combinations of source and target 
domains. In particular, unsupervised and semi-supervised experiments are listed in Table [6] and 
Table [7] respectively. We include the state-of-art results of max-margin domain transformations 
(MMDT) 1131 on this dataset in Table [7] Note that the comparison with MMDT is somewhat unfair 
for our method, because it exploits the discrimination power of a max-margin classifier, whereas 
ours is the simple distance metric learning based 1-NN classifier. However, it is promising that, with 
such simple classifier, our BDML still achieves state-of-the-art results in some subtasks. 

11.2 Face Verification 

In this section, we present full experimental comparisons on LFW dataset. In Table[8] we list various 
published results on “Image-Restricted, Label-Free Outside Data” setting of LFW dataset. Specif¬ 
ically, the abbreviations of these algorithms are MERL fl6l . Xing HTI , ITML Q, LDML fl2l . 
DML-eig-SIFT g2], Sub-ITML ©, KISSME J25|, Sub-ML g], LBP+CSML ED, DML-eig- 
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Methods 

Mean Acc. 

Std Err.(±) 

MERL 

0.7052 

0.0060 

Xing 

0.7593 

0.0059 

ITML 

0.7812 

0.0045 

LDML 

0.7927 

0.0060 

DML-eig-SIFT 

0.8127 

0.0230 

Sub-ITML 

0.8145 

0.0046 

KISSME 

0.8308 

0.0056 

Sub-ML 

0.8330 

0.0026 

LBP+CSML 

0.8557 

0.0052 

DML-eig-Combined 

0.8565 

0.0056 

p-BDML 

0.8632 

0.0022 

Convolutional DBN 

0.8777 

0.0062 

Sub-SML 

0.8973 

0.0038 

DDML-Combined 

0.9068 

0.0141 


Table 8: Full comparison on “Image-Restricted, Label-Free Outside Data” setting of LFW dataset. 


Combined l42\ . Convolutional DBN liTTl . Sub-SML |6) and DDML-Combined l22l . Among them, 
“Convolutional DBN” and “DDML-Combined” are deep learning based methods. Suffix “Com¬ 
bined” means the method uses multiple descriptors, e.g., SIFT 11281 . LBP l33l . TPLBP l40l . etc. 
From this table, we can find that, although using only dimension-reduced SIFT feature, our BDML 
algorithm achieves comparable results with other feature-combined and non-metric learning based 
ones. Moreover, our method achieves the least stand errors compared to others which indicates that 
our BDML produces stable metrics. The ROC curves versus others are plotted in Fig. [4] 


12 Useful Tail Bound for Chi-square Variables 

We list the following sharp tail bound for chi-square variables. 
Lemma 7. K27l Let X ~ Xd an d e > 0, then 

P{X -d> 2 Vde + 2e) < exp (—e) 
P{X - d < -2Vde) < exp (—e). 
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